Main Research Goal

The variable that interest us the most is quality since we want to understand which chemical properties influence the quality of red wines.

We load the data set of red wines quality. This dataset is tidy and contains 1,599 red wines records with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

The 12 variables of the wine are listed below:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

We can see a variable X which indicates the index of the record in the dataset. We definitively want to remove X before we move forward

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Once we removed X, we can continue to understand the variables on the dataset.

Attribute Measure Units of Each Variable

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):

  1. Fixed acidity (tartaric acid - \(g / dm^3\))
  2. Volatile acidity (acetic acid - \(g / dm^3\))
  3. Citric acid (\(g / dm^3\))
  4. Residual sugar (\(g / dm^3\))
  5. Chlorides (sodium chloride - \(g / dm^3\))
  6. Free sulfur dioxide (\(mg / dm^3\))
  7. Total sulfur dioxide (\(mg / dm^3\))
  8. Density (\(g / cm^3\))
  9. pH
  10. Sulphates (potassium sulphate - \(g / dm3\))
  11. Alcohol (% by volume)

Output variable (based on sensory data):

  1. Quality (score between 0 and 10)

Note: All Missing Attribute Values are set as None.

Description of Attributes

  1. Fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

  2. Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

  3. Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

  4. Residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

  5. Chlorides: the amount of salt in the wine

  6. Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

  7. Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

  8. Density: the density of water is close to that of water depending on the percent alcohol and sugar content

  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

  10. Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

  11. Alcohol: the percent alcohol content of the wine

Output variable (based on sensory data):

  1. Quality: the score of a red wine, between 0 (lowest) and 10 (highest)

Univariate Plots Section

In the previous sections, we have an overview of the dataset and here we can start with a summary of the dataset information for each variable.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Let’s start looking at the quality summary, we can notice that the lowest quality of red wines was 3 and the maximum was 8. This tell us there are neither very bad quality wines nor very excellent wines in this dataset. Also, we want to make sure our quality variable is actually categorical (we need as a Factor in R).

##  Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

We are sure the quality variable is categorical and we can continue exploring it more in detail.

From this plot we can tell the quality 5 is the most frequent and it is closely followed by quality 6. On the other hand, we have 3 and 8 as the least frequent. This plot will help us understand why we might have more medium quality wines in our future plots.

There is one variable that stands out at a quick glance in the summary. The density seems to have a very tiny difference between minimum, median and maximum values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Just as expected the density is mainly between 0.995 and 1 \(g/cm^3\), which seems an indication of all wines having similar density.

Once we explored quality and density it might be good to look at the other variables.

From this plot, we can see that density is plot just as we did before. This can confirm our plot to be correct.

Also, we can go further an explore a couple variables more in detail.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The fixed acidity peaks in the range 7 to 9 \(g/dm^3\). Since we know that the most frequent quality is 5 and 6, this might be an indication that fixed acidity levels 7 to 9 could be the quality range 5 to 6. Following the same intuition, we can think that the least frequent values in the histogram can be either higher or lower quality. In the next section, we will need to investigate this further using two variables.

Now, we can check the volatile acidity

## 
## 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.6 
##   3  54 193 350 281 370 190  85  37  26   4   3   2   1

If we use fine bins for the volatile.acidity histogram we can see two or three trends at 0.4, 0.5 and 0.6. If we follow the higest two peaks at 0.4 and 0.6, we can imagine them to be related to the most frequent quality of wines, so we basically can think that volatile acidity in these peaks is mainly related to quality 5-6. To confirm this we will need a more ellaborated plot with two variables.

For the time being, we can continue to explore other variables such as citric.acid

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

From a fine histogram the citric.acid seems not to be a clear contributing factor to a red wine quality. However, the fact that there is still more than one peak makes us doubt if each peak can be related to a group of quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Surprisingly, most wines have low residual.sugar, and it could be that good quality is associated to extremly low or high residual sugar. This might be a good variable to help us distinguish low quality vs good quality wines.

Finally, let’s check another left tailed distribution that according to the name seems to be related to the wine quality. I am referring to alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

This histogram seems to have a peak between 9.5 and 10.5, but a very low level between 8.4 and 9.5. Same happens the higher the alcohol level gets, the smaller the bin height gets. This can be an indication of alcohol associated to the quality as the low and high quality wines are few.

Note: We will need to explore more, but it seems skewed distributions might be related to quality.


Univariate Analysis

This univariate analysis was the first step on the exploration and to get familiar with the data. Basically, some histograms were performed to understand the distributions of the features and also to understand what are the most frequent quality grades of red wines in the dataset.

What is the structure of your dataset?

The dataset is tidy and contains 1,599 red wines records with 12 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). However, the quality of the wines in the dataset only range from 3 to 8 since there are no 0,1,2,9 nor 10 graded wines in our data.

What is/are the main feature(s) of interest in your dataset?

Quality is the main interest variable. Our goal is to figure out which elements contribute to the quality of a wine.

From our exploration I could tell that the quality has mainly 5 or 6 grade. Using some intuition at this point, we might consider that tailed histograms can be features that we want to consider as there is more information on the 5 and 6 grade compared to lower and higher red wine quality.

This idea also applies to distributions that seem bimodal or have more than one peak such as citric acid. In my opinion, this distributions might also have a hidden pattern related to quality and that might be the reason of having more than one mode.

Thus, after our histograms, some variables that seem promising when understanding quality are:

  • Citric acid
  • Fixed acidity
  • Free sulfur dioxide
  • Volatile acidity
  • Total sulfur dioxide
  • Alcohol
  • Residual sugar
  • Chlorides
  • Sulphates

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I also explored the density and it seems that it can be different for all wines, but the intersting part wil be to explore if the tails of the distribution contains elements from all wine qualities of if they are related to lower and higher quality wines.

Did you create any new variables from existing variables in the dataset?

No, there was not a need for now to create a new variable.

Of the features investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

In the histograms, we found:

Normal or very close to normal distributions

  • pH
  • Density

Left tailed distributions

  • Total Sulfur Dioxide
  • Alcohol
  • Fixed Acidity
  • Free Sulfur Dioxide
  • Volatile Acidity
  • Chlorides
  • Sulphates
  • Residual sugar

Bimodal or not normal distributions

  • Citric Acid

It seems that the not normal distributions might be more related to the quality since we have a higher number of quality 5 and 6 wines. This can mean that the tails can explain the lowest and highest quality wines on the dataset.

Luckily this dataset was made from tidy data and the only variable that needed a type change was quality. This was to make it a factor and have it as a true categorical variable rather than numerical.


Bivariate Plots Section

In the previous section, we use our intuition to choose some relevant variables, so it will be a good idea to find what is the correlation between all variables in the dataset to narrow our exploration and remove possible colinearities.

The first thing to start our bivariate analysis will be to check which variables might have a correlation with each other. For this we will use pearson’s correlation coefficient.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00

Since data is hard to visualize from the correlation matrix, we will plot it.

From the plot we can easily see what variables are related and choose which ones to analyze:

Chlorides and sulphates are correlated so choosing only one of them might be good to start our analysis. Thus, I will add sulphates to our list.

The same happens for fixed acidity and citric acid. However, citric acid is a variable with a strange distribution, so we better keep it to explore how this relates to quality.

Citric acid seems also correlated to volatile acidity. However, volatile acidity is not correlated to fixed acidity. It might be good to keep volatile acidity and explore how it can contribute to quality.

Total sulfur dioxide and free sulfur dioxide seem correlated, but not with anything else. Thus, I will choose the total sulfur dioxide to investigate it.

We have already decided to keep citric acid, so we can keep alcohol despite it is a bit correlated, this is mainly to test if alcohol level actually impacts a wine quality.

Note: Wine is alcohol, so my curiousity drives me to understand if the alcohol level matters.

Finally, from our promising features, we have the residual sugar which seems not correlated to anything. it also had a slightly left tailed distribution, so it might be worthy to check how it affects quality or if it is neutral.

Once we have chosen these features, we had also decided to understand if density played part on the quality, so we will also explore it.

Now, we can generate plots to inspect the variables we chose to explore to find insights about the relationship of them to quality.

From the plot, we can see that alcohol and density seem to show the expected slight correlation we had before. Same applies to volatile acidity and citric acid. Besides those variables everything else seems to have almost no correlation which can be a good indicator to understand the independent variable that relate to quality.

Note: Density is actually a variable with most of the slight correlations in the plot, so if we find density to be not very important, we can remove it. it also has a normal distribution which can be a reason of it correlating slightly with other variables.

We can explore further relations between quality and variables. let’s start from density since we noticed most wines have similar densities.

Density

We can see that there seems no correlation between density and quality. However, because it seems symetric, I applied some math to the density and plot it again

After applying a log to the density and then obtaining the absolute value of that operation, we can confirm that wines with a higher quality seems to have a few high/low density outliers, while low quality wines have no high/low density outliers. However, most wines have a desnity in between 0.995 and 1.0 \(g/cm^3\). This does feature seems to be slighly contributing to the quality.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

The box plot shows that there is no much difference in density between wines which is what we noticed before in the histogram. However, the scatter plot and statistics show us that there is a slight difference in wines from quality 7 to 8 and here we can see that the higher quality wines tend to have slightly lower density on average.

Alcohol

Another variable that caught my attention only because of the name is alcohol. A wine is alcohol so we should check what is the relation between alcohol and quality.

From the plot it seems that the higher the alcohol level the more quality a wine tends to have.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

The boxplot ans statistics confirm that from level 5, the average amount of alcohol increases as quality increases and it is oscillating while quality decreases.

Volatile Acidity

Since we noticed that boxplots were easier to understand since quality is a categorical variable, we will follow analysis doing boxplots. The next variable is volatile acidity.

Let’s do some math to complement the boxplots.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Interestingly enough the median volatile acidity constantly reduces as the wine quality grade increases.

At this point the alcohol and volatile acidity seem to be related to wine quality.

Sulphates

Another variable to analyze is sulphates

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

The sulphates also seem to have a correlation to the quality as their mean and median increase the higher the quality of a red wine. However, we can notice that the median and average values of higher qualities also exist in the third and max values of lower quality wines. This makes sure that there must be another variable that can help us define quality. In fact, we have alcohol and volatile acidity as possible variables to help define quality and now we add sulphates to the list.

Citric Acid

If the alcohol and volatile acidity had a correlation to the wine quality, it will be interesting to check the citric acidity, which is correlated to volatile acidity from what we saw in the correlation matrix.

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

We noticed an increasing trend on the average citric acid as well as the quality, this is the opposite as the volatile acidity. In other words, the more citric acid the more quality a wine has and this makes perfect sense since citric acid and volatile acidity have a negative correlation.

Total Sulfur Dioxide

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

The total sulfur dioxide does not have a clear correlation to the quality of a wine since its average values peak at quality level 5, but decreases as quality moves away from level 5.

Residual Sugar

Finally, we can analyze the residual sugar and how it is related to quality. There are some wines that are considered sweeter than others. Could this also be realted to quality?

## red$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## red$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## red$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## red$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## red$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## red$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

Apparently the residual sugar of a wine does not define the quality of a wine since it average values oscillate from quality level 3 to 8. Maybe that can explain why wines that are sweeter can be also have high quality.

Once we have analyzed our desired variables, we can make some conclusions out of the plots.


Bivariate Analysis

From the variables we chose, we found that 4 or 5 of them seem to be contributing to the quality of a red wine. The order in which, I would rank them (1 being the highest noticeable contribution):

  1. Volatile acidity
  2. Sulphates
  3. Citric acid
  4. Alcohol
  5. Density

How did the feature(s) of interest vary with other features in the dataset?

In this bivariate analysis we found the correlation matrix between our variables, we left the quality out of the matrix as it is a categorical variable.

From the correlation matrix, we reduce the number of promising features to explore based on the variables’ correlation.

Note: pH was correlated to almost all our variables.

Soemthing we mentioned before is that the features of interest had a left tailed distribution or completely random. This was also seen in the boxplots of each feature vs quality. Moreover, with the correlation matrix we narrow the features of interest even further.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

We had decided to review the density since we thought it might contribute to define highest and lowest quality wines. It in fact was confirmed that density slightly contribute to quality since the higher the quality of a wine the slightly lower density the wine has. This was first shown with a scatter plot and then better appreciated with a box plot (which in practice are better in this cases as I use a categorical variable vs a numerical variable).

What was the strongest relationship you found?

In order to determine the strongest relationship we used box plots for all the chosen features of interest selectd after the correlation matrix. the analysis was as follow:

From the more promising features chosen, we started with alcohol vs quality. Since wine is alcoholic, this might affect wine taste in a huge proportion. We found that indeed the higher the level of alcohol the higher the quality of the red wine.

Then, we checked volatile acidity and we found that it constantly reduces while the quality grade of a wine gets higher.

From there, we did a boxplot of sulphates and they showed to have a trend related to quality. The sulphates get higher for higher quality wines.

Citric acid was our next variable to analyze and mainly because it had a completely different distribution compared to other features. Something interesting about citric acid is that it was correlated to volatile acidity. Since both turned to be related we were not surprised that citric acid that for quality wines 5 to 8 the more citric acid the higher the quality of a wine.

Finally, the total sulfur dioxide and the residual sugar didn’t show a clear correlation to the quality and no trends were noticed.

If we sort these features of interest and our additional variable which was density, we ended up with the following list which is ordered from strongest to weakest relationship with quality of the variable listed:

  1. Volatile acidity - Negative. Lower volatile acidity, higher quality
  2. Sulphates - Positive. Higher sulphates, higher quality
  3. Citric acid - Positive. Higher citric acid, higher quality. [Negative correlation with Volatile acidity]
  4. Alcohol - Positive. Higher alcohol level, higher quality
  5. Density - Negative. Lower density, higher quality. [Negative correlation with Alcohol]

However, we can notice that the last two variables citric acid and density are correlated to volatile acidity and alcohol, respectively. This might mean we need to drop these two variables. In any situation, these two variables were left only for curiosity and exploration.


Multivariate Plots Section

We start plotting the variables that showed a trend in relation to quality.

Let’s divide this plot in plots per quality category

We can see from the plots that there are some areas where the quality levels get clustered. Let’s check the volatile acidity and sulphates counts of wines per quality.

Starting with volatile acidity

## round(red$volatile.acidity, 1): 0.1
## 
## 3 4 5 6 7 8 
## 0 0 0 0 3 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.2
## 
##  3  4  5  6  7  8 
##  0  1  8 30 15  0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.3
## 
##  3  4  5  6  7  8 
##  0  1 37 82 68  5 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.4
## 
##   3   4   5   6   7   8 
##   1   4 125 158  54   8 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.5
## 
##   3   4   5   6   7   8 
##   0   8 109 139  23   2 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.6
## 
##   3   4   5   6   7   8 
##   2  11 202 130  23   2 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.7
## 
##   3   4   5   6   7   8 
##   0   7 120  55   8   0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.8
## 
##  3  4  5  6  7  8 
##  2  7 42 29  4  1 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 0.9
## 
##  3  4  5  6  7  8 
##  1  6 21  8  1  0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 1
## 
##  3  4  5  6  7  8 
##  2  5 12  7  0  0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 1.1
## 
## 3 4 5 6 7 8 
## 0 3 1 0 0 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 1.2
## 
## 3 4 5 6 7 8 
## 1 0 2 0 0 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 1.3
## 
## 3 4 5 6 7 8 
## 0 0 2 0 0 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity, 1): 1.6
## 
## 3 4 5 6 7 8 
## 1 0 0 0 0 0

Then, we do sulphates counts

## round(red$sulphates, 1): 0.3
## 
## 3 4 5 6 7 8 
## 0 1 0 0 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.4
## 
##  3  4  5  6  7  8 
##  1  4 38  8  2  0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.5
## 
##   3   4   5   6   7   8 
##   4  17 202  86   7   0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.6
## 
##   3   4   5   6   7   8 
##   4  24 264 255  44   3 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.7
## 
##   3   4   5   6   7   8 
##   0   3  89 126  50   7 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.8
## 
##   3   4   5   6   7   8 
##   0   0  40 100  65   5 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.9
## 
##  3  4  5  6  7  8 
##  1  1 14 36 21  2 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1
## 
##  3  4  5  6  7  8 
##  0  0  7 13  5  0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.1
## 
##  3  4  5  6  7  8 
##  0  2 10  5  4  1 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.2
## 
## 3 4 5 6 7 8 
## 0 0 8 3 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.3
## 
## 3 4 5 6 7 8 
## 0 0 5 1 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.4
## 
## 3 4 5 6 7 8 
## 0 0 0 2 1 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.6
## 
## 3 4 5 6 7 8 
## 0 0 3 1 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 2
## 
## 3 4 5 6 7 8 
## 0 1 1 2 0 0

Surprisingly, the sulphates and the volatile acidity form clusters of the red wine quality levels. The sulphates range [0.6,0.9] and the volatile acidity range [0.3,0.6] contain the higher number of high quality wines. This might be something very insightful to determine the quality of a red wine and these two variables might be what we are interested the most.

Let’s analyze these two variables (volatile acidity and sulphates) and compare each of them against the citric acid variable, which from level 5 quality showed a trend that higher the citric acid, the higher the quality.

We will start with the sulphates and the citric acid plot

Surprinsingly, the citric acid and sulphates do not cluster the quality levels like the volatile acidity and sulphates did. Despite citric acid being negatively correlated to volatile acidity.

Let’s continue the analysis comparing volatile acidity and citric acid.

## [1] -0.6102595

After comparing the volatile acidity and the citric acid we can see a slight negative correlation betwen these two variables. However, the 8 level quality values are very sparse. The level 7 seem to be clustered in two blobs, but it is pretty simialr to what we saw in the citric acid and sulphates plot. Thus, we might have citric acid influecing a little, but not a major factor. Also, we have volatile acidity which is slightly correlated to citric acid, so if we were to build a model we rather take volatile acidity and sulphates as of now.

Once we have checked the first three variables we saw show clearer trends related to quality, we cna move to analyze our last two variables: alcohol and density. Let’s move forward in the analysis using the alcohol variable and compare it against out two main variables.

As a reminder, the alcohol showed a trend in which higher alcohol, the higher the quality.

Let’s start comparing alcohol and volatile acidity.

## round(red$alcohol): 8
## 
## 3 4 5 6 7 8 
## 1 0 1 1 0 0 
## -------------------------------------------------------- 
## round(red$alcohol): 9
## 
##   3   4   5   6   7   8 
##   1  12 200  79   2   0 
## -------------------------------------------------------- 
## round(red$alcohol): 10
## 
##   3   4   5   6   7   8 
##   5  22 374 248  35   2 
## -------------------------------------------------------- 
## round(red$alcohol): 11
## 
##   3   4   5   6   7   8 
##   3  14  85 172  58   4 
## -------------------------------------------------------- 
## round(red$alcohol): 12
## 
##   3   4   5   6   7   8 
##   0   4  12 109  81   4 
## -------------------------------------------------------- 
## round(red$alcohol): 13
## 
##  3  4  5  6  7  8 
##  0  1  8 23 18  6 
## -------------------------------------------------------- 
## round(red$alcohol): 14
## 
## 3 4 5 6 7 8 
## 0 0 0 6 5 2 
## -------------------------------------------------------- 
## round(red$alcohol): 15
## 
## 3 4 5 6 7 8 
## 0 0 1 0 0 0

In the plot, it seems that the lower the volatile acidity and the lower the alcohol the quality tends to be good. However, the higher the alcohol gets, it allows the volatile acidity to go higher and still get some good quality wines.

The alcohol range for which we get higher qualities goes from 10 to 14 % volume. While the volatile acidity as we saw in previous plots goes from 0.3 to 0.6 \(g/dm^3\).

Once we compared the alcohol to the volatile acidity, let’s move to compare to the sulphates

This plot shows that the higher quality wines are in a specific range of sulphates and alcohol values. We can do some counting of wines per quality according to alcohol and sulphates to find the ranges where best quality wines live.

Starting with the alcohol and quality counts:

## round(red$alcohol): 8
## 
## 3 4 5 6 7 8 
## 1 0 1 1 0 0 
## -------------------------------------------------------- 
## round(red$alcohol): 9
## 
##   3   4   5   6   7   8 
##   1  12 200  79   2   0 
## -------------------------------------------------------- 
## round(red$alcohol): 10
## 
##   3   4   5   6   7   8 
##   5  22 374 248  35   2 
## -------------------------------------------------------- 
## round(red$alcohol): 11
## 
##   3   4   5   6   7   8 
##   3  14  85 172  58   4 
## -------------------------------------------------------- 
## round(red$alcohol): 12
## 
##   3   4   5   6   7   8 
##   0   4  12 109  81   4 
## -------------------------------------------------------- 
## round(red$alcohol): 13
## 
##  3  4  5  6  7  8 
##  0  1  8 23 18  6 
## -------------------------------------------------------- 
## round(red$alcohol): 14
## 
## 3 4 5 6 7 8 
## 0 0 0 6 5 2 
## -------------------------------------------------------- 
## round(red$alcohol): 15
## 
## 3 4 5 6 7 8 
## 0 0 1 0 0 0

Then, we move to sulphates and quality counts:

## round(red$sulphates, 1): 0.3
## 
## 3 4 5 6 7 8 
## 0 1 0 0 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.4
## 
##  3  4  5  6  7  8 
##  1  4 38  8  2  0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.5
## 
##   3   4   5   6   7   8 
##   4  17 202  86   7   0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.6
## 
##   3   4   5   6   7   8 
##   4  24 264 255  44   3 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.7
## 
##   3   4   5   6   7   8 
##   0   3  89 126  50   7 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.8
## 
##   3   4   5   6   7   8 
##   0   0  40 100  65   5 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 0.9
## 
##  3  4  5  6  7  8 
##  1  1 14 36 21  2 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1
## 
##  3  4  5  6  7  8 
##  0  0  7 13  5  0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.1
## 
##  3  4  5  6  7  8 
##  0  2 10  5  4  1 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.2
## 
## 3 4 5 6 7 8 
## 0 0 8 3 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.3
## 
## 3 4 5 6 7 8 
## 0 0 5 1 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.4
## 
## 3 4 5 6 7 8 
## 0 0 0 2 1 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 1.6
## 
## 3 4 5 6 7 8 
## 0 0 3 1 0 0 
## -------------------------------------------------------- 
## round(red$sulphates, 1): 2
## 
## 3 4 5 6 7 8 
## 0 1 1 2 0 0

From the plot and statistics, we can see that it is more common to have good wines after 10 % volume alcohol(except on 15% vol) and the higher the alcohol level, more chances of a grade 8 wine. Also, it seems grade 8 wines mainly appear when sulphates are in the range of 0.6 and 0.9.

Once we have compared the alcohol with sulphates and found not a very strong pattern, we can plot alcohol versus citric acid.

The higher level quality wines are very sparse, so these variables together might not be good indicator of quality.

Finally, let’s explore the alcohol and density since they show a correlation in our correlation matrix.

Despite alcohol and density are slightly correlated, they together do not explain the high quality wines.

Before making conclusions about the multivariate analysis, it might be interesting to see if the volatile acidity per % volume of alcohol actually helps alcohol play part in this analysis.

Surprisingly dividing the volatile acidity by the % volume of alcohol made the higher quality wines come closer and be clustered easier than without dividing the volatile acidity with alcohol.

We can compare this plot against the first plot on this multivariate analysis if we use some numbers.

## round(red$volatile.acidity/red$alcohol, 2): 0.01
## 
## 3 4 5 6 7 8 
## 0 0 0 4 3 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.02
## 
##  3  4  5  6  7  8 
##  0  1  9 39 40  1 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.03
## 
##   3   4   5   6   7   8 
##   0   4  39 116  80  12 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.04
## 
##   3   4   5   6   7   8 
##   1   3 105 140  30   3 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.05
## 
##   3   4   5   6   7   8 
##   0   6 126 135  24   1 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.06
## 
##   3   4   5   6   7   8 
##   1  11 161 119  18   0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.07
## 
##   3   4   5   6   7   8 
##   1   6 141  49   3   1 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.08
## 
##  3  4  5  6  7  8 
##  2 11 60 26  0  0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.09
## 
##  3  4  5  6  7  8 
##  2  4 24  6  1  0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.1
## 
##  3  4  5  6  7  8 
##  1  6 11  1  0  0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.11
## 
## 3 4 5 6 7 8 
## 1 0 1 3 0 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.12
## 
## 3 4 5 6 7 8 
## 0 1 3 0 0 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.13
## 
## 3 4 5 6 7 8 
## 0 0 1 0 0 0 
## -------------------------------------------------------- 
## round(red$volatile.acidity/red$alcohol, 2): 0.14
## 
## 3 4 5 6 7 8 
## 1 0 0 0 0 0

From the table in here and what we learned on the table fo the first plot of the section, we can see that when we divided with alcohol the volatle acidity we definitively helped to make high quality wines come together. In fact, in the first plot of volatile acidity versus sulphates, our maximum number of wines of quality 8 was 8 and it happened when volatile acidity was 0.4 \(g/dm^3\). On the other hand, when we plotted volatile acidity by alcohol versus sulphates, the maximum number of quality 8 wines was 12 and it was present when volatile acidity per % alcohol was 0.03 \(g/dm^3\) per % vol of alcohol. This defiitively shows a better cluster for quality of wines.

If we want to go a bit further we can try to measure correlation for cor( volatile acidity, sulphates) and cor( volatile acidity per % volume of alcohol, sulphates).

Note: We only need numbers for the volatile acidity divided by alcohol since sulphates stayed the same.

## [1] -0.325584
## [1] -0.3569725

We can see the first correlation coefficient to be closer to 0 and that makes volatile acidity and sulphates less correlated than volatile acidity per alcohol and sulphates.

I still see they are not so correlated, but this is more of a classification problem, so mutliple logistic regression would be a better candidate to build a model in the future.


Multivariate Analysis

Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

It was surprising to find that sulphates in the range of 0.6 to 0.9 \(g/dm^3\) and volatile acidity range of 0.6 and 0.9 \(g/dm^3\) contain the highest quality wines. It was also interesting to find that these two chemical properties are making clusters of the quality levels. This all makes sense since quality is a category and we are facing a classification problem.

Were there any interesting or surprising interactions between features?

Finding that volatile acidity and sulphates are very promising to find the quality of a wine was surprising, but it was even more interesting to find that when we divided the volatile acidity by the alcohol and plot that versus the sulphates gave even a better relation of variables to find the chemicals that can be driving the quality of a wine.

In fact, the cluster of quality 8 wines was 50% greater when volatile acidity was divided by alcohol than by leaving it alone.

This really exemplified the importance of combining features.


Final Plots and Summary

Plot One

Description One

This was the first plot we made and it helped us to understand the amount of wines per quality category that we have in our dataset. Clearly, we can see that we have more quality 5 and 6 wines than any other category of wines. The fact that we know that allowed us to understand that it will be completely fine that chemical components will not show normal distributions. In other words, we will have many skewed distributions and these might be the ones we take a look at to find promising variables that explain the quality of a red wine.

Plot Two

Description Two

Once we analyzed variables distributions by quality, there were more details about chemical properties being more related to quality levels. The box plots allowed to discover two important checmical properties: Volatile acidity and sulphates. The box plots picture that volatile acidity \(g/gm^3\) decrease the higher the quality a wine has. While the sulphates \(g/dm^3\) increase in higher quality wines. The fact that we found these two variables helped understand that there are some checmical properties that can explain the quality of a red wine.

Plot Three

Description Three

This plot shows how the different levels of quality of a red wine overlap and at the same time it shows clusters of the levels of quality. This is very important when identifying the quality of a specific wine since this is a classification problem. The fact that we see clusters using volatile acidity per alcohol and sulphates means that these chemical properties are good indicators of how a red wine quality. We can also see some outliers in the data, but mainly we see groups of our quality levels. One can also appreciate quality 5 and 6 clusters spread all over the plot which confirms what we learned from our dataset at the very beggining. All in all, this plot can be very useful to see the different quality levels and in which ranges of \(g/dm^3\) of volatile acidity per % alcohol and sulphates one can have higher quality wines.


Reflection

In this project, I happened to learn a lot about red wine quality and the checmical properties that are measure to give a wine a quality grade.

The hardest part of the analysis was to first start understanding the data and try to find something useful to start with and continue unveiling more and more in new analysis.

While plotting univariate distributions it was very hard to determine if a variable might be related to an important finding, but using the intuition was key and challenging. Later on, during bivariate analysis it was a bit easier to decide which variables to choose based on the box plots the trends seen with quality, but the possible plots to explore and relations to find were too many to even consider them all. At that point I just had the idea to explore correlated variables and start from simple to have a strong decision based on simple relations rather than very complicated ones. However, I felt I was approaching a dead end when a chemical property showed no relation to quality. It was challenging that things that made sense for me such as alcohol related to a wine quality were actually not really true when exploring the data. However, I was successful in finding the volatile acidity and the sulphates as important variables that seemed to follow a trend related to quality levels. From there things became a bit easier and I knew exactly what I wanted to do at the very beggining of multivariate analysis. In this final analysis, I started plotting the volatile acidity versus the sulphates and coloring them by quality to see if there were any visible signs of cluster of wine quality. In fact, there were clusters, a bit spread but decent enough to see them. I continue to a dead end in which no other chemical components showed to form a cluster or explain quality. That was when I decided that my first plot of volatile acidity and sulphates was the best I have gotten so far. Then, it was a day later that I still not believe that alcohol % volume was not related to quality since wine is alcohol. That lead me to divide volatile acidity over alcohol and I found something that blew my mind once I plotted such variables. The clusters I have found in the plot of volatile acidity versus the sulphates were now tighter. This finding was in fact a reward of the hard effort to figure out if other variables that logically seemed to be connected to the quality of a wine, specially alcohol, were actually connected.

All in all, I had a great experience anayzing the data, struggling and figuring out a new way when things didn’t look good in the path I was taking. Other times, I just had explore until I found something useful to move further with such finding. This project definitively taught me how to approach an EDA to find something interesting and also to trust intution once in a while.

Future Work

As far as future work with the findings, the best next step will be to create a multi-classifier logistic regression model. I would also be tempted to get more data to get a more balanced dataset in all categories and also include quality levels 0, 1, 2, 9 and 10. This will definitively benefit the model we build and we might even have to do another quick exploration to determine our current chemical properties still explain red wines quality.

Limitations

The limitations in these analysis are related to data unbalance and a possible bias problem toward the quality 5 and 6 red wines. In here also, we explored one of the relations that seemed strong to explain quality; however, there are many possibilities as this is an open ended problem that someone else can find different relations that can be close to explain quality of a red wine. Another limitation is that the dataset contained 12 variables that are related to a wine, but there might be more such as time that a wine was left to get to point, the type of grapes used, the region form where the wine is, etc. More variables can contribute to determine the quality and we might be looking at only a small set of them. All in all, be aware that this analysis helps understand quality form the current dataset, but not for every type of wine you can find out there in the world.

Resources

While creating this project there were no special resources used, but the documentation of R ggplot.